Instructions

The following tutorial will let you reproduce the plots that we are going to create at the workshop using R.
Please read carefully and follow the steps. Wherever you see the Code icon on the right you can click on it to see the actual code used in that section.

Introduction

This tutorial will focus on analysing the updated data of the worldwide Novel Corona virus (COVID-19) pandemic.
There are several data sources available online. We will use the data collected from a range of official sources and hosted on the Our World in Data website (Mathieu et al. 2021).

Analyse Data in R

To run R and RStudio on Binder, click on this badge - Launch Rstudio Binder.

Start RStudio and create a new project named Workshop3 in a new folder (if you need a reminder ho to do it, check out Workshop1 Tutorial on BB).
Once RStudio restarts inside the project’s folder, create a new R script named Workshop3.R and 2 new folders, one named data for our input data and another named output for our plots.

Install Extra Packages

For this analysis we will again use some packages from the Tidyverse, but this time we load the specific packages (which are supposed to be pre-installed on your computers) to try and avoid having to download the entire tidyverse. In addition to the Tidyverse packages we’ve got to know in the previous workshop we will use the plotly package to create interactive plots, paletteer for custom color palettes, readxl to read MS-Excel file, scales to format large numbers, lubridate to better handle dates, glue to paste together strings, patchwork to include several plots in a single figure and a few others to assist in getting the data into shape.
To use these packages, we will introduce a package called pacman that will assist in loading the required packages and installing them if they’re not already installed. To install it we use the install.packages('pacman') command, please note that the package name need to be quoted and that we only need to perform it once, or when we want or need to update the package. Once the package was installed, we can load its functions using the library(pacman) command and then load/install all the other packages at once with p_load() function.

# install required packages - needed only once! (comment with a # after first use)
install.packages('pacman')
# load required packages
library(pacman)
p_load(tidyverse,  paletteer, glue, scales, plotly, lubridate, patchwork, visdat)

More information on installing and using R packages can be found in this tutorial.

Read Data

Now that we’ve got RStudio up and running and our packages installed and loaded, we can read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.).

We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data directly from a file on the Our World in Data website into a variable of type data frame (table). If we don’t want to use external packages, we can use the read.csv() function from base R, which won’t automatically parse columns containing dates and in previous versions of R (<4.0) will slightly change the structure of the resulting data frame (all text columns will be converted into factors).
> Note that in this case, we need to specify the column types because the data contains a lot of missing values that interfere with the automatic parsing of the column types._

# read data from Our World in Data website
covid_data <- read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv", 
                       col_types = paste(c("c", "f", "c", "D", rep("d", 29), 
                                           "c", rep("d", 33)), collapse = ""))

Data Exploration

Let’s use built-in functions for a brief data exploration, such as head() to show the first 10 rows of the data and str() for the type of data in each column (see detailed information on each variable in the data repository on GitHub):

#explore the data frame
head(covid_data) # show first 10 rows of the data and type of variables
## # A tibble: 6 × 67
##   iso_code continent location date       total…¹ new_c…² new_c…³ total…⁴ new_d…⁵
##   <chr>    <fct>     <chr>    <date>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 AFG      Asia      Afghani… 2020-01-03      NA       0      NA      NA       0
## 2 AFG      Asia      Afghani… 2020-01-04      NA       0      NA      NA       0
## 3 AFG      Asia      Afghani… 2020-01-05      NA       0      NA      NA       0
## 4 AFG      Asia      Afghani… 2020-01-06      NA       0      NA      NA       0
## 5 AFG      Asia      Afghani… 2020-01-07      NA       0      NA      NA       0
## 6 AFG      Asia      Afghani… 2020-01-08      NA       0       0      NA       0
## # … with 58 more variables: new_deaths_smoothed <dbl>,
## #   total_cases_per_million <dbl>, new_cases_per_million <dbl>,
## #   new_cases_smoothed_per_million <dbl>, total_deaths_per_million <dbl>,
## #   new_deaths_per_million <dbl>, new_deaths_smoothed_per_million <dbl>,
## #   reproduction_rate <dbl>, icu_patients <dbl>,
## #   icu_patients_per_million <dbl>, hosp_patients <dbl>,
## #   hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>, …

Descriptive Statistics

We can also produce some descriptive statistics to better understand the data and the nature of each variable. The skim() function (from the skimr package) provides an alternative to the base summary() function and produces a summary of basic descriptive statistics, such as the mean, min, max, quantiles and even the SD and rate of missing data for continuous numerical values.

# summary of variables in my data
# summary(covid_data)
skim(covid_data)
Table 1: Data summary
Name covid_data
Number of rows 297112
Number of columns 67
_______________________
Column type frequency:
character 3
Date 1
factor 1
numeric 62
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
iso_code 0 1.00 3 8 0 255 0
location 0 1.00 4 32 0 255 0
tests_units 190324 0.36 13 15 0 4 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2020-01-01 2023-03-25 2021-08-14 1180

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
continent 0 1 FALSE 7 Afr: 66919, Eur: 64332, Asi: 58969, Nor: 48146

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_cases 35711 0.88 5367079.76 3.374066e+07 1.00 5999.00 58008.00 598581.00 7.610711e+08 ▇▁▁▁▁
new_cases 8437 0.97 11175.52 1.052843e+05 0.00 0.00 21.00 580.00 7.946912e+06 ▇▁▁▁▁
new_cases_smoothed 9701 0.97 11220.60 1.025490e+05 0.00 1.29 41.43 667.43 6.405128e+06 ▇▁▁▁▁
total_deaths 56125 0.81 77985.92 4.045377e+05 1.00 116.00 1163.00 10159.50 6.879664e+06 ▇▁▁▁▁
new_deaths 8371 0.97 99.56 6.124100e+02 0.00 0.00 0.00 6.00 2.000500e+04 ▇▁▁▁▁
new_deaths_smoothed 9601 0.97 99.95 6.029200e+02 0.00 0.00 0.29 7.14 1.458129e+04 ▇▁▁▁▁
total_cases_per_million 35711 0.88 82039.20 1.321067e+05 0.00 1797.13 18217.91 98122.61 7.295334e+05 ▇▁▁▁▁
new_cases_per_million 8437 0.97 168.43 1.144710e+03 0.00 0.00 3.09 75.72 2.288720e+05 ▇▁▁▁▁
new_cases_smoothed_per_million 9701 0.97 169.11 6.485400e+02 0.00 0.32 11.82 111.03 3.724178e+04 ▇▁▁▁▁
total_deaths_per_million 56125 0.81 783.10 1.031080e+03 0.00 47.09 312.94 1188.39 6.450830e+03 ▇▂▁▁▁
new_deaths_per_million 8371 0.97 1.06 4.780000e+00 0.00 0.00 0.00 0.48 6.036600e+02 ▇▁▁▁▁
new_deaths_smoothed_per_million 9601 0.97 1.06 2.970000e+00 0.00 0.00 0.04 0.82 1.486400e+02 ▇▁▁▁▁
reproduction_rate 112295 0.62 0.91 4.000000e-01 -0.07 0.72 0.95 1.14 5.870000e+00 ▇▃▁▁▁
icu_patients 262888 0.12 731.65 2.271270e+03 0.00 25.00 115.00 489.00 2.889100e+04 ▇▁▁▁▁
icu_patients_per_million 262888 0.12 17.40 2.367000e+01 0.00 3.09 7.97 21.81 1.806800e+02 ▇▁▁▁▁
hosp_patients 262382 0.12 4264.68 1.048947e+04 0.00 258.00 862.50 3397.75 1.544970e+05 ▇▁▁▁▁
hosp_patients_per_million 262382 0.12 143.92 1.592700e+02 0.00 39.00 91.51 189.33 1.526850e+03 ▇▁▁▁▁
weekly_icu_admissions 288158 0.03 380.44 5.495600e+02 0.00 31.00 147.00 505.00 4.838000e+03 ▇▁▁▁▁
weekly_icu_admissions_per_million 288158 0.03 11.54 1.433000e+01 0.00 2.70 6.33 14.99 2.249800e+02 ▇▁▁▁▁
weekly_hosp_admissions 276074 0.07 4606.02 1.154401e+04 0.00 279.00 964.50 4365.00 1.539770e+05 ▇▁▁▁▁
weekly_hosp_admissions_per_million 276074 0.07 93.35 9.171000e+01 0.00 29.93 68.72 126.35 7.078300e+02 ▇▂▁▁▁
total_tests 217725 0.27 21104573.94 8.409869e+07 0.00 364654.00 2067330.00 10248451.50 9.214000e+09 ▇▁▁▁▁
new_tests 221709 0.25 67285.41 2.477340e+05 1.00 2244.00 8783.00 37229.00 3.585563e+07 ▇▁▁▁▁
total_tests_per_thousand 217725 0.27 924.25 2.195430e+03 0.00 43.59 234.14 894.37 3.292583e+04 ▇▁▁▁▁
new_tests_per_thousand 221709 0.25 3.27 9.030000e+00 0.00 0.29 0.97 2.91 5.310600e+02 ▇▁▁▁▁
new_tests_smoothed 193147 0.35 142178.36 1.138215e+06 0.00 1486.00 6570.00 32205.00 1.476998e+07 ▇▁▁▁▁
new_tests_smoothed_per_thousand 193147 0.35 2.83 7.310000e+00 0.00 0.20 0.85 2.58 1.476000e+02 ▇▁▁▁▁
positive_rate 201185 0.32 0.10 1.200000e-01 0.00 0.02 0.06 0.14 1.000000e+00 ▇▁▁▁▁
tests_per_case 202764 0.32 2403.63 3.344366e+04 1.00 7.10 17.50 54.60 1.023632e+06 ▇▁▁▁▁
total_vaccinations 223668 0.25 352587778.14 1.361098e+09 0.00 1414636.50 10750208.50 75326708.50 1.334085e+10 ▇▁▁▁▁
people_vaccinated 226775 0.24 157789754.47 6.080700e+08 0.00 825370.00 5131934.00 36909579.00 5.566232e+09 ▇▁▁▁▁
people_fully_vaccinated 229093 0.23 139481626.81 5.502372e+08 1.00 714027.50 4651793.00 30598444.50 5.118345e+09 ▇▁▁▁▁
total_boosters 254869 0.14 83600781.13 3.028949e+08 1.00 273127.00 3504945.00 24221600.00 2.752028e+09 ▇▁▁▁▁
new_vaccinations 236424 0.20 853265.79 3.426199e+06 0.00 3264.00 28292.50 227965.00 4.967365e+07 ▇▁▁▁▁
new_vaccinations_smoothed 136327 0.54 338504.59 2.111189e+06 0.00 428.00 5062.00 39318.00 4.369208e+07 ▇▁▁▁▁
total_vaccinations_per_hundred 223668 0.25 112.70 8.332000e+01 0.00 33.35 110.66 180.66 4.064300e+02 ▇▆▅▁▁
people_vaccinated_per_hundred 226775 0.24 50.33 2.999000e+01 0.00 22.64 58.00 76.38 1.290700e+02 ▇▅▇▆▁
people_fully_vaccinated_per_hundred 229093 0.23 45.45 2.958000e+01 0.00 15.98 52.10 71.98 1.268900e+02 ▇▅▇▅▁
total_boosters_per_hundred 254869 0.14 31.84 2.951000e+01 0.00 2.99 28.28 55.12 1.504700e+02 ▇▅▂▁▁
new_vaccinations_smoothed_per_million 136327 0.54 2184.58 3.318720e+03 0.00 209.00 934.00 2954.00 1.171130e+05 ▇▁▁▁▁
new_people_vaccinated_smoothed 136219 0.54 125170.59 8.584109e+05 0.00 80.00 1246.00 12718.00 2.107121e+07 ▇▁▁▁▁
new_people_vaccinated_smoothed_per_hundred 136219 0.54 0.09 1.900000e-01 0.00 0.00 0.02 0.10 1.171000e+01 ▇▁▁▁▁
stringency_index 103918 0.65 43.52 2.437000e+01 0.00 23.15 43.52 62.50 1.000000e+02 ▆▆▇▆▂
population_density 45032 0.85 412.18 1.882120e+03 0.14 37.73 90.67 222.87 2.054677e+04 ▇▁▁▁▁
median_age 62651 0.79 30.51 9.080000e+00 15.10 22.20 29.70 38.70 4.820000e+01 ▇▆▇▆▆
aged_65_older 70860 0.76 8.70 6.090000e+00 1.14 3.53 6.38 13.93 2.705000e+01 ▇▃▂▂▁
aged_70_older 65003 0.78 5.50 4.140000e+00 0.53 2.08 3.87 8.64 1.849000e+01 ▇▃▂▂▁
gdp_per_capita 67341 0.77 19024.25 2.001259e+04 661.24 3823.19 12294.88 27216.44 1.169356e+05 ▇▂▁▁▁
extreme_poverty 149118 0.50 13.84 2.009000e+01 0.10 0.60 2.50 21.40 7.760000e+01 ▇▂▁▁▁
cardiovasc_death_rate 66921 0.77 264.25 1.209300e+02 79.37 171.28 244.66 331.43 7.244200e+02 ▇▇▃▁▁
diabetes_prevalence 55198 0.81 8.56 4.940000e+00 0.99 5.35 7.20 10.79 3.053000e+01 ▇▇▂▁▁
female_smokers 124437 0.58 10.79 1.078000e+01 0.10 1.90 6.30 19.30 4.400000e+01 ▇▂▂▁▁
male_smokers 126787 0.57 32.91 1.357000e+01 7.70 22.60 33.10 41.30 7.810000e+01 ▆▇▆▂▁
handwashing_facilities 184387 0.38 50.79 3.196000e+01 1.19 20.86 49.84 83.24 1.000000e+02 ▇▅▅▅▇
hospital_beds_per_thousand 93913 0.68 3.10 2.550000e+00 0.10 1.30 2.50 4.20 1.380000e+01 ▇▃▂▁▁
life_expectancy 23909 0.92 73.72 7.400000e+00 53.28 69.59 75.05 79.46 8.675000e+01 ▁▃▅▇▅
human_development_index 73985 0.75 0.72 1.500000e-01 0.39 0.60 0.74 0.83 9.600000e-01 ▂▅▅▇▆
population 0 1.00 128297446.46 6.603186e+08 47.00 449002.00 5882259.00 28301700.00 7.975105e+09 ▇▁▁▁▁
excess_mortality_cumulative_absolute 286907 0.03 46832.79 1.371560e+05 -37726.10 19.30 4342.60 31048.00 1.282260e+06 ▇▁▁▁▁
excess_mortality_cumulative 286907 0.03 9.49 1.310000e+01 -44.23 0.37 7.69 15.46 7.655000e+01 ▁▅▇▁▁
excess_mortality 286907 0.03 13.07 2.669000e+01 -95.92 -0.99 6.80 18.62 3.767700e+02 ▂▇▁▁▁
excess_mortality_cumulative_per_million 286907 0.03 1437.72 1.822880e+03 -1984.28 12.53 863.34 2344.06 1.032952e+04 ▇▇▃▁▁
# find extreme rows
covid_data %>% arrange(desc(total_cases))
## # A tibble: 297,112 × 67
##    iso_code continent locat…¹ date       total…² new_c…³ new_c…⁴ total…⁵ new_d…⁶
##    <chr>    <fct>     <chr>   <date>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 OWID_WRL ""        World   2023-03-21  7.61e8   33950  93977. 6879664     198
##  2 OWID_WRL ""        World   2023-03-20  7.61e8   54418 101010. 6879466     256
##  3 OWID_WRL ""        World   2023-03-19  7.61e8   80448 104900. 6879210     325
##  4 OWID_WRL ""        World   2023-03-18  7.61e8  216329 117118. 6878885    2478
##  5 OWID_WRL ""        World   2023-03-17  7.61e8   76049 124942. 6876407     294
##  6 OWID_WRL ""        World   2023-03-16  7.61e8   94307 127202. 6876113     380
##  7 OWID_WRL ""        World   2023-03-15  7.61e8  102340 127761. 6875733     407
##  8 OWID_WRL ""        World   2023-03-14  7.60e8   83178 127969. 6875326     542
##  9 OWID_WRL ""        World   2023-03-13  7.60e8   81651 127906. 6874784     471
## 10 OWID_WRL ""        World   2023-03-12  7.60e8  165974 128449. 6874313     897
## # … with 297,102 more rows, 58 more variables: new_deaths_smoothed <dbl>,
## #   total_cases_per_million <dbl>, new_cases_per_million <dbl>,
## #   new_cases_smoothed_per_million <dbl>, total_deaths_per_million <dbl>,
## #   new_deaths_per_million <dbl>, new_deaths_smoothed_per_million <dbl>,
## #   reproduction_rate <dbl>, icu_patients <dbl>,
## #   icu_patients_per_million <dbl>, hosp_patients <dbl>,
## #   hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>, …

What are the metadata columns that describe our observations?

continent 
location  
date

Why do we have observations with missing continent?

# check which location have continent as NA
covid_data %>% filter(continent=="" | is.na(continent)) %>% count(location)
## # A tibble: 12 × 2
##    location                n
##    <chr>               <int>
##  1 Africa               1174
##  2 Asia                 1178
##  3 Europe               1178
##  4 European Union       1178
##  5 High income          1178
##  6 Low income           1174
##  7 Lower middle income  1178
##  8 North America        1177
##  9 Oceania              1174
## 10 South America        1177
## 11 Upper middle income  1178
## 12 World                1178
Some rows contain summarised data of entire continents/World, we'll need to remove those

We can see that most of our data contains ‘0’ (check the difference between the median and the mean in total_cases and total_deaths columns). Just to confirm that, let’s plot a histogram of all the confirmed cases

ggplot(covid_data, aes(x=total_cases)) +
  geom_histogram(fill="lightskyblue") +
  theme_bw(def_text_size)

The data is evolving over days (a time-series), to there’s no point treating it as a random population sample.

Time-series plot

Let’s look at confirmed cases and total deaths data for the 10 most affected countries (to date). To find out these countries so we need to wrangle our data a little bit using the following steps:

  1. First we remove all observations for combined continents with filter(!is.na(continent)
  2. Then we group it by location with group_by()
  3. Then we sort it within each location by date (from latest to earliest) with arrange(desc(date))
  4. We select just the most recent data point from each location with slice(1) and remove grouping with ungroup()
  5. Next we arrange it by descending order of total cases and select the top 10 observations (one for each location)
  6. Finally, we subset our original data to contain just the countries from our vector with inner_join()

Optional step:

  1. We can recode the location variable as a factor and order it so the countries will be ordered in the legend by the number of cases

Then we can look at the data as a table and make a plot with the number of cases in the y-axis and date in the x-axis.

# find the 10 most affected countries (to date)
latest_data <- covid_data %>% filter(!is.na(continent), continent!="") %>% 
  group_by(location) %>% arrange(desc(date)) %>% slice(1) %>% ungroup() 
most_affected_countries <- latest_data  %>%  
  arrange(desc(total_cases)) %>% slice(1:10) %>% 
  select(location)

# subset just the data from the 10 most affected countries and order them from the most affected to the least one
most_affected_data <- covid_data %>% 
  inner_join(most_affected_countries) %>% 
  mutate(Country=factor(location, levels = most_affected_countries$location))

# create a line plot the data of total cases
ggplot(most_affected_data, aes(x=date, y=total_cases, colour=Country)) +
  geom_line(size=0.75) + scale_y_continuous(labels=comma) + 
  scale_color_paletteer_d("rcartocolor::Bold") +
  labs(color="Country", y = "Total COVID-19 cases") +
  theme_bw(def_text_size)

We can look at the exponential increase at the early days of the pandemic if we transform the y-axis to logarithmic scale (and also improve how of the dates appear in the x-axis).

# better formatting of date axis, log scale 
plot <- ggplot(most_affected_data, aes(x=date, y=total_cases, colour=Country)) +
  geom_line(size=0.75) + scale_y_log10(labels=comma) + 
  scale_x_date(NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  scale_color_paletteer_d("rcartocolor::Bold") +
  labs(color="Country", y = "Total COVID-19 cases") +
  theme_bw(def_text_size)
# show an interactive plot
ggplotly(plot)

Why did we get a warning message and why the graphs don’t start at the bottom of the x-axis? How can we solve it? What can we infer from the graph (exponential increase)?

What happens when we take the log of 0?? Can we remove those 0s with the `filter()` function (or add a very small number to them)?
We can see a very similar trend for most countries and while the curve has flattened substantially in 2021, the numbers are still rising. It is also evident that Europe got hit by a second wave arount October 2020 and then again around Feb 2022 (with a huge rise in cases in South Korea).

Vaccination rates

Let’s have a look at the number of vaccinated people.

# vaccination rates
ggplot(most_affected_data %>% filter(date>dmy("01-01-2021")), aes(x=date, y=people_vaccinated, colour=Country)) +
  geom_line(size=0.75) + scale_y_continuous(labels=comma) + 
  scale_color_paletteer_d("rcartocolor::Bold") +
  scale_x_date(NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country", y="People Vaccinated") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank())

The graphs are “broken”, meaning that it is not continuous and we have some missing data.
Let’s visualise some of the variables in our data and assess “missingness”.

# visualise missingness
vis_dat(covid_data %>% filter(date>dmy("01-01-2022")) %>% 
          select(continent, location, total_cases, total_deaths, 
                 hosp_patients, people_vaccinated, people_fully_vaccinated))

# find which countries has the most number of observations (least missing data)
# covid_data %>% filter(!is.na(continent), continent!="", !is.na(people_vaccinated)) %>% # group_by(location) %>% 
#   count(location) %>% arrange(desc(n)) %>% print(n=30)
# 
# covid_data %>% filter(!is.na(continent), continent!="", !is.na(hosp_patients)) %>% # group_by(location) %>% 
#   count(location) %>% arrange(desc(n)) %>% print(n=30)

Hospitalisation and Vaccination rates

Now we can focus on a subset of countries that have more complete vaccination and hospitalisation rates data, so we could compare them.

countries_subset <- c("Italy", "Canada", "Israel", "United Kingdom",  "France", "Australia", 
                      "Malaysia", 'South Africa', "India")
# subset our original data to these countries
hosp_data <- covid_data %>% filter(location %in% countries_subset)
# define a new colour palette for these countries
col_pal <- "ggsci::category10_d3"

Let’s look at hospitalisation rates first.

ggplot(hosp_data, aes(x=date, y = hosp_patients,colour=location)) +
  geom_line(size=0.75) + scale_y_continuous(labels=comma) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(name = NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country", y = "Hospitalised patients") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank())

Can you identify the “waves” in each country?
It’s hard to see the details in the countries with lower number of hospitalised patients, how can we improve the visualisation?

Look at hospitalision rates proportional to the population size!
# hosp per population size
p1 <- ggplot(hosp_data, 
       aes(x=date, y = hosp_patients_per_million,colour=location)) +
  geom_line(size=0.75) + 
  scale_y_continuous(labels=comma) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(name = NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country", y = "Hospitalised patients (per million)") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank())
p1

Now let’s try to compare this to vaccination rates

# total vaccination per population
p2 <- ggplot(hosp_data, 
             aes(x=date, y = people_fully_vaccinated_per_hundred/100 ,colour=location)) +
  geom_line(size=0.75, linetype="dashed") + 
  guides(color = guide_legend(override.aes = list(linetype="solid") ) ) +
  scale_y_continuous(labels=percent) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(name = NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country", y = "Rate of fully vaccinated people") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank())
p2

What will be the best way to compare these values? Maybe like this (note that this syntax is used by patchwork package to arrange the plots):

(p1 + guides(color="none")) / (p2 + theme(legend.position = "bottom")) #+ plot_layout(guides = 'collect')# show graphs on top of each other

Any other suggestions? Let’s try to present them on the same graph (note the trick with the secondary y-axis), but for each country separately (done with the facet_wrap() function)..

# show on the same graph
p4 <- ggplot(hosp_data, 
       aes(x=date, colour=location)) +
  geom_line(aes(y = hosp_patients_per_million), size=0.75) + 
  geom_line(aes(y = people_fully_vaccinated_per_hundred*10), size=0.75, linetype="dashed") + 
  scale_y_continuous(name = "Hospitalised patients per million (solid)",
                     # Add a second axis and specify its features
                     sec.axis = sec_axis(trans=~./10,  
                                         name="Fully vaccinated per hundred (dashed)")) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(NULL,
               breaks = breaks_width("6 months"), 
               labels = label_date_short()) + 
  labs(color="Country") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank()) + 
  facet_wrap(~location)
p4

Save the plot to a file.

# create output folder
dir.create("./output", showWarnings = FALSE)
# save the plot to pdf file
ggsave("output/hospit_vacc_rates_facet_country.pdf", width=14, height = 8)

Questions

  1. What other variables we could analyse?
  2. Any correlated variables?
  3. What we should take into account that might bias the results or the true status of the pandemic?
1. Mortalities (Case Fatality Rate)?  
2. Suggestions? (cases per population density, vaccination rates by country income, deaths by number of beds per capita, etc.
3. Level of reporting in each country...  

Additional Resources

Now take the plunge and start practicing with your own data!!

tweet_embed("https://twitter.com/YannisMarkonis/status/1276211134186602499")

Contact

Please contact me at i.bar@griffith.edu.au for any questions or comments.

References

Mathieu E, Ritchie H, Ortiz-Ospina E, et al. (2021) A global database of COVID-19 vaccinations. Nat Hum Behav 5:947–953. doi: 10.1038/s41562-021-01122-8